97 research outputs found

    High Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints

    Get PDF
    An essential component of genome function is the syntax of genomic regulatory elements that determine how diverse transcription factors interact to orchestrate a program of regulatory control. A precise characterization of in vivo spacing constraints between key transcription factors would reveal key aspects of this genomic regulatory language. To discover novel transcription factor spatial binding constraints in vivo, we developed a new integrative computational method, genome wide event finding and motif discovery (GEM). GEM resolves ChIP data into explanatory motifs and binding events at high spatial resolution by linking binding event discovery and motif discovery with positional priors in the context of a generative probabilistic model of ChIP data and genome sequence. GEM analysis of 63 transcription factors in 214 ENCODE human ChIP-Seq experiments recovers more known factor motifs than other contemporary methods, and discovers six new motifs for factors with unknown binding specificity. GEM's adaptive learning of binding-event read distributions allows it to further improve upon previous methods for processing ChIP-Seq and ChIP-exo data to yield unsurpassed spatial resolution and discovery of closely spaced binding events of the same factor. In a systematic analysis of in vivo sequence-specific transcription factor binding using GEM, we have found hundreds of spatial binding constraints between factors. GEM found 37 examples of factor binding constraints in mouse ES cells, including strong distance-specific constraints between Klf4 and other key regulatory factors. In human ENCODE data, GEM found 390 examples of spatially constrained pair-wise binding, including such novel pairs as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4A/FOXA1. The discovery of new factor-factor spatial constraints in ChIP data is significant because it proposes testable models for regulatory factor interactions that will help elucidate genome function and the implementation of combinatorial control

    STAMP: a web tool for exploring DNA-binding motif similarities

    Get PDF
    STAMP is a newly developed web server that is designed to support the study of DNA-binding motifs. STAMP may be used to query motifs against databases of known motifs; the software aligns input motifs against the chosen database (or alternatively against a user-provided dataset), and lists of the highest-scoring matches are returned. Such similarity-search functionality is expected to facilitate the identification of transcription factors that potentially interact with newly discovered motifs. STAMP also automatically builds multiple alignments, familial binding profiles and similarity trees when more than one motif is inputted. These functions are expected to enable evolutionary studies on sets of related motifs and fixed-order regulatory modules, as well as illustrating similarities and redundancies within the input motif collection. STAMP is a highly flexible alignment platform, allowing users to ‘mix-and-match’ between various implemented comparison metrics, alignment methods (local or global, gapped or ungapped), multiple alignment strategies and tree-building methods. Motifs may be inputted as frequency matrices (in many of the commonly used formats), consensus sequences, or alignments of known binding sites. STAMP also directly accepts the output files from 12 supported motif-finders, enabling quick interpretation of motif-discovery analyses. STAMP is available at http://www.benoslab.pitt.edu/stam

    DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies

    Get PDF
    Transcription factor (TF) proteins recognize a small number of DNA sequences with high specificity and control the expression of neighbouring genes. The evolution of TF binding preference has been the subject of a number of recent studies, in which generalized binding profiles have been introduced and used to improve the prediction of new target sites. Generalized profiles are generated by aligning and merging the individual profiles of related TFs. However, the distance metrics and alignment algorithms used to compare the binding profiles have not yet been fully explored or optimized. As a result, binding profiles depend on TF structural information and sometimes may ignore important distinctions between subfamilies. Prediction of the identity or the structural class of a protein that binds to a given DNA pattern will enhance the analysis of microarray and ChIP–chip data where frequently multiple putative targets of usually unknown TFs are predicted. Various comparison metrics and alignment algorithms are evaluated (a total of 105 combinations). We find that local alignments are generally better than global alignments at detecting eukaryotic DNA motif similarities, especially when combined with the sum of squared distances or Pearson's correlation coefficient comparison metrics. In addition, multiple-alignment strategies for binding profiles and tree-building methods are tested for their efficiency in constructing generalized binding models. A new method for automatic determination of the optimal number of clusters is developed and applied in the construction of a new set of familial binding profiles which improves upon TF classification accuracy. A software tool, STAMP, is developed to host all tested methods and make them publicly available. This work provides a high quality reference set of familial binding profiles and the first comprehensive platform for analysis of DNA profiles. Detecting similarities between DNA motifs is a key step in the comparative study of transcriptional regulation, and the work presented here will form the basis for tool and method development for future transcriptional modeling studies

    Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models

    Get PDF
    Background: Many current gene prediction methods use only one model to represent proteincoding regions in a genome, and so are less likely to predict the location of genes that have an atypical sequence composition. It is likely that future improvements in gene finding will involve the development of methods that can adequately deal with intra-genomic compositional variation. Results: This work explores a new approach to gene-prediction, based on the Self-Organizing Map, which has the ability to automatically identify multiple gene models within a genome. The current implementation, named RescueNet, uses relative synonymous codon usage as the indicator of protein-coding potential. Conclusions: While its raw accuracy rate can be less than other methods, RescueNet consistently identifies some genes that other methods do not, and should therefore be of interest to geneprediction software developers and genome annotation teams alike. RescueNet is recommended for use in conjunction with, or as a complement to, other gene prediction methods

    Regulatory conservation of protein coding and microRNA genes in vertebrates: lessons from the opossum genome

    Get PDF
    BACKGROUND: Being the first noneutherian mammal sequenced, Monodelphis domestica (opossum) offers great potential for enhancing our understanding of the evolutionary processes that take place in mammals. This study focuses on the evolutionary relationships between conservation of noncoding sequences, cis-regulatory elements, and biologic functions of regulated genes in opossum and eight vertebrate species. RESULTS: Analysis of 145 intergenic microRNA and all protein coding genes revealed that the upstream sequences of the former are up to twice as conserved as the latter among mammals, except in the first 500 base pairs, where the conservation is similar. Comparison of promoter conservation in 513 protein coding genes and related transcription factor binding sites (TFBSs) showed that 41% of the known human TFBSs are located in the 6.7% of promoter regions that are conserved between human and opossum. Some core biologic processes exhibited significantly fewer conserved TFBSs in human-opossum comparisons, suggesting greater functional divergence. A new measure of efficiency in multigenome phylogenetic footprinting (base regulatory potential rate [BRPR]) shows that including human-opossum conservation increases specificity in finding human TFBSs. CONCLUSION: Opossum facilitates better estimation of promoter conservation and TFBS turnover among mammals. The fact that substantial TFBS numbers are located in a small proportion of the human-opossum conserved sequences emphasizes the importance of marsupial genomes for phylogenetic footprinting-based motif discovery strategies. The BRPR measure is expected to help select genome combinations for optimal performance of these algorithms. Finally, although the etiology of the microRNA upstream increased conservation remains unknown, it is expected to have strong implications for our understanding of regulation of their expression

    A multi-parametric flow cytometric assay to analyze DNA–protein interactions

    Get PDF
    Interactions between DNA and transcription factors (TFs) guide cellular function and development, yet the complexities of gene regulation are still far from being understood. Such understanding is limited by a paucity of techniques with which to probe DNA–protein interactions. We have devised magnetic protein immobilization on enhancer DNA (MagPIE), a simple, rapid, multi-parametric assay using flow cytometric immunofluorescence to reveal interactions among TFs, chromatin structure and DNA. In MagPIE, synthesized DNA is bound to magnetic beads, which are then incubated with nuclear lysate, permitting sequence-specific binding by TFs, histones and methylation by native lysate factors that can be optionally inhibited with small molecules. Lysate protein–DNA binding is monitored by flow cytometric immunofluorescence, which allows for accurate comparative measurement of TF-DNA affinity. Combinatorial fluorescent staining allows simultaneous analysis of sequence-specific TF-DNA interaction and chromatin modification. MagPIE provides a simple and robust method to analyze complex epigenetic interactions in vitro

    A Cdx4-Sall4 Regulatory Module Controls the Transition from Mesoderm Formation to Embryonic Hematopoiesis

    Get PDF
    Summary Deletion of caudal/cdx genes alters hox gene expression and causes defects in posterior tissues and hematopoiesis. Yet, the defects in hox gene expression only partially explain these phenotypes. To gain deeper insight into Cdx4 function, we performed chromatin immunoprecipitation sequencing (ChIP-seq) combined with gene-expression profiling in zebrafish, and identified the transcription factor spalt-like 4 (sall4) as a Cdx4 target. ChIP-seq revealed that Sall4 bound to its own gene locus and the cdx4 locus. Expression profiling showed that Cdx4 and Sall4 coregulate genes that initiate hematopoiesis, such as hox, scl, and lmo2. Combined cdx4/sall4 gene knockdown impaired erythropoiesis, and overexpression of the Cdx4 and Sall4 target genes scl and lmo2 together rescued the erythroid program. These findings suggest that auto- and cross-regulation of Cdx4 and Sall4 establish a stable molecular circuit in the mesoderm that facilitates the activation of the blood-specific program as development proceeds

    An Integrated Model of Multiple-Condition ChIP-Seq Data Reveals Predeterminants of Cdx2 Binding

    Get PDF
    Regulatory proteins can bind to different sets of genomic targets in various cell types or conditions. To reliably characterize such condition-specific regulatory binding we introduce MultiGPS, an integrated machine learning approach for the analysis of multiple related ChIP-seq experiments. MultiGPS is based on a generalized Expectation Maximization framework that shares information across multiple experiments for binding event discovery. We demonstrate that our framework enables the simultaneous modeling of sparse condition-specific binding changes, sequence dependence, and replicate-specific noise sources. MultiGPS encourages consistency in reported binding event locations across multiple-condition ChIP-seq datasets and provides accurate estimation of ChIP enrichment levels at each event. MultiGPS's multi-experiment modeling approach thus provides a reliable platform for detecting differential binding enrichment across experimental conditions. We demonstrate the advantages of MultiGPS with an analysis of Cdx2 binding in three distinct developmental contexts. By accurately characterizing condition-specific Cdx2 binding, MultiGPS enables novel insight into the mechanistic basis of Cdx2 site selectivity. Specifically, the condition-specific Cdx2 sites characterized by MultiGPS are highly associated with pre-existing genomic context, suggesting that such sites are pre-determined by cell-specific regulatory architecture. However, MultiGPS-defined condition-independent sites are not predicted by pre-existing regulatory signals, suggesting that Cdx2 can bind to a subset of locations regardless of genomic environment. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.National Science Foundation (U.S.) (Graduate Research Fellowship under Grant 0645960)National Institutes of Health (U.S.) (grant P01 NS055923)Pennsylvania State University. Center for Eukaryotic Gene Regulatio
    corecore